The Mathematical Framework
The core objective is to find a vector $x \in \mathbb{R}^n$ such that the linear combination $Ax = x_1a_1 + \dots + x_na_n$ best approximates $b$. This is frequently referred to as the regression of $b$ onto the regressors (the columns of $A$).
We focus on the residual vector $r = Ax - b$. In practice, we assume an overdetermined system where $m > n$. Why? Because when $m = n$ and $A$ is non-singular, the optimal point is simply $A^{-1}b$, yielding zero error—a trivial case for optimization.
Canonical Variations
Depending on the "flavor" of error we want to penalize, we choose different norms:
The most common approach. It minimizes the sum of the squares of the residuals: $\|Ax - b\|_2^2$. It is sensitive to large outliers but offers an analytical solution via the normal equations.
Minimizes the maximum absolute residual $\max_i |r_i|$. This is used when every single measurement must stay within a strict tolerance. It can be solved via the following Linear Program (LP):
minimize $t$
subject to $-t\mathbf{1} \preceq Ax - b \preceq t\mathbf{1}$
Minimizes $\sum |r_i|$. This approach is robust to outliers, as it doesn't square the errors. It is also solvable via an LP:
minimize $\mathbf{1}^T t$
subject to $-t \preceq Ax - b \preceq t$
Estimation Context
In many engineering fields, we assume a true state $x$ is obscured by noise: $y = Ax + v$. Our goal is to find an estimate $\hat{x} = \text{argmin}_z \|Az - y\|$. By choosing the norm, we are effectively making an assumption about the statistical distribution of the noise $v$.